¿Que estudiamos y que vimos?
A continuación
make_df('ABC', range(3))
| A | B | C | |
|---|---|---|---|
| 0 | A0 | B0 | C0 |
| 1 | A1 | B1 | C1 |
| 2 | A2 | B2 | C2 |
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
df1
| A | B | |
|---|---|---|
| 1 | A1 | B1 |
| 2 | A2 | B2 |
df2
| A | B | |
|---|---|---|
| 3 | A3 | B3 |
| 4 | A4 | B4 |
pd.concat([df1, df2])
| A | B | |
|---|---|---|
| 1 | A1 | B1 |
| 2 | A2 | B2 |
| 3 | A3 | B3 |
| 4 | A4 | B4 |
Pandas preserva el indice. Suponiendo elcaso de dos datases con el mismo indice. 0-1
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index
display('x', 'y', 'pd.concat([x, y])')
x
| A | B | |
|---|---|---|
| 0 | A0 | B0 |
| 1 | A1 | B1 |
y
| A | B | |
|---|---|---|
| 0 | A2 | B2 |
| 1 | A3 | B3 |
pd.concat([x, y])
| A | B | |
|---|---|---|
| 0 | A0 | B0 |
| 1 | A1 | B1 |
| 0 | A2 | B2 |
| 1 | A3 | B3 |
Opciones:
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')
x
| A | B | |
|---|---|---|
| 0 | A0 | B0 |
| 1 | A1 | B1 |
y
| A | B | |
|---|---|---|
| 0 | A2 | B2 |
| 1 | A3 | B3 |
pd.concat([x, y], ignore_index=True)
| A | B | |
|---|---|---|
| 0 | A0 | B0 |
| 1 | A1 | B1 |
| 2 | A2 | B2 |
| 3 | A3 | B3 |
¿Que pasa cuando falta una columna?
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')
df5
| A | B | C | |
|---|---|---|---|
| 1 | A1 | B1 | C1 |
| 2 | A2 | B2 | C2 |
df6
| B | C | D | |
|---|---|---|---|
| 3 | B3 | C3 | D3 |
| 4 | B4 | C4 | D4 |
pd.concat([df5, df6])
| A | B | C | D | |
|---|---|---|---|---|
| 1 | A1 | B1 | C1 | NaN |
| 2 | A2 | B2 | C2 | NaN |
| 3 | NaN | B3 | C3 | D3 |
| 4 | NaN | B4 | C4 | D4 |
sellers_df = pd.read_csv("data/olist_sellers_dataset.csv")
sellers_df
| seller_id | seller_zip_code_prefix | seller_city | seller_state | |
|---|---|---|---|---|
| 0 | 3442f8959a84dea7ee197c632cb2df15 | 13023 | campinas | SP |
| 1 | d1b65fc7debc3361ea86b5f14c68d2e2 | 13844 | mogi guacu | SP |
| 2 | ce3ad9de960102d0677a81f5d0bb7b2d | 20031 | rio de janeiro | RJ |
| 3 | c0f3eea2e14555b6faeea3dd58c1b1c3 | 4195 | sao paulo | SP |
| 4 | 51a04a8a6bdcb23deccc82b0b80742cf | 12914 | braganca paulista | SP |
| ... | ... | ... | ... | ... |
| 3090 | 98dddbc4601dd4443ca174359b237166 | 87111 | sarandi | PR |
| 3091 | f8201cab383e484733266d1906e2fdfa | 88137 | palhoca | SC |
| 3092 | 74871d19219c7d518d0090283e03c137 | 4650 | sao paulo | SP |
| 3093 | e603cf3fec55f8697c9059638d6c8eb5 | 96080 | pelotas | RS |
| 3094 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP |
3095 rows × 4 columns
Hago una muestra del dataset con cantidad = sample(n)
diez_datos = sellers_df.sample(10)
diez_datos
| seller_id | seller_zip_code_prefix | seller_city | seller_state | |
|---|---|---|---|---|
| 1327 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP |
| 1900 | 344223b2a90784f64136a8a5da012e7f | 89117 | gaspar | SC |
| 593 | 30829ded4523ab9224b93bc49a62c95f | 95360 | parai | RS |
| 2060 | a2b911e9ad40dc7d61c142a4683e2d85 | 27353 | barra mansa | RJ |
| 1574 | 2d9e8987a2ea5cf88dce330249043ff1 | 15155 | jaci | SP |
| 1130 | e3dd723429d1b23614b7fa85e2c7a853 | 9810 | sao bernardo do campo | SP |
| 2377 | eed78ac17f7f795a19a709745f00cd4e | 18608 | botucatu | SP |
| 790 | bd697db56ff8e7c9cd1abeb7f72029b2 | 3533 | sao paulo | SP |
| 1362 | 8a43128d7f9a3db592b866e6861a6cce | 8080 | sao paulo | SP |
| 2677 | 49f8d2ea7b718d0ce587d34207caf2b4 | 9830 | sao bernardo do campo | SP |
cinco_datos = sellers_df.tail(5)
cinco_datos
| seller_id | seller_zip_code_prefix | seller_city | seller_state | |
|---|---|---|---|---|
| 3090 | 98dddbc4601dd4443ca174359b237166 | 87111 | sarandi | PR |
| 3091 | f8201cab383e484733266d1906e2fdfa | 88137 | palhoca | SC |
| 3092 | 74871d19219c7d518d0090283e03c137 | 4650 | sao paulo | SP |
| 3093 | e603cf3fec55f8697c9059638d6c8eb5 | 96080 | pelotas | RS |
| 3094 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP |
pd.concat([diez_datos,cinco_datos])
| seller_id | seller_zip_code_prefix | seller_city | seller_state | |
|---|---|---|---|---|
| 1327 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP |
| 1900 | 344223b2a90784f64136a8a5da012e7f | 89117 | gaspar | SC |
| 593 | 30829ded4523ab9224b93bc49a62c95f | 95360 | parai | RS |
| 2060 | a2b911e9ad40dc7d61c142a4683e2d85 | 27353 | barra mansa | RJ |
| 1574 | 2d9e8987a2ea5cf88dce330249043ff1 | 15155 | jaci | SP |
| 1130 | e3dd723429d1b23614b7fa85e2c7a853 | 9810 | sao bernardo do campo | SP |
| 2377 | eed78ac17f7f795a19a709745f00cd4e | 18608 | botucatu | SP |
| 790 | bd697db56ff8e7c9cd1abeb7f72029b2 | 3533 | sao paulo | SP |
| 1362 | 8a43128d7f9a3db592b866e6861a6cce | 8080 | sao paulo | SP |
| 2677 | 49f8d2ea7b718d0ce587d34207caf2b4 | 9830 | sao bernardo do campo | SP |
| 3090 | 98dddbc4601dd4443ca174359b237166 | 87111 | sarandi | PR |
| 3091 | f8201cab383e484733266d1906e2fdfa | 88137 | palhoca | SC |
| 3092 | 74871d19219c7d518d0090283e03c137 | 4650 | sao paulo | SP |
| 3093 | e603cf3fec55f8697c9059638d6c8eb5 | 96080 | pelotas | RS |
| 3094 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP |
pd.concat([diez_datos,cinco_datos], ignore_index=True)
| seller_id | seller_zip_code_prefix | seller_city | seller_state | |
|---|---|---|---|---|
| 0 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP |
| 1 | 344223b2a90784f64136a8a5da012e7f | 89117 | gaspar | SC |
| 2 | 30829ded4523ab9224b93bc49a62c95f | 95360 | parai | RS |
| 3 | a2b911e9ad40dc7d61c142a4683e2d85 | 27353 | barra mansa | RJ |
| 4 | 2d9e8987a2ea5cf88dce330249043ff1 | 15155 | jaci | SP |
| 5 | e3dd723429d1b23614b7fa85e2c7a853 | 9810 | sao bernardo do campo | SP |
| 6 | eed78ac17f7f795a19a709745f00cd4e | 18608 | botucatu | SP |
| 7 | bd697db56ff8e7c9cd1abeb7f72029b2 | 3533 | sao paulo | SP |
| 8 | 8a43128d7f9a3db592b866e6861a6cce | 8080 | sao paulo | SP |
| 9 | 49f8d2ea7b718d0ce587d34207caf2b4 | 9830 | sao bernardo do campo | SP |
| 10 | 98dddbc4601dd4443ca174359b237166 | 87111 | sarandi | PR |
| 11 | f8201cab383e484733266d1906e2fdfa | 88137 | palhoca | SC |
| 12 | 74871d19219c7d518d0090283e03c137 | 4650 | sao paulo | SP |
| 13 | e603cf3fec55f8697c9059638d6c8eb5 | 96080 | pelotas | RS |
| 14 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP |
Es lo mismo hacer append en pandas
display('df1', 'df2', 'df1.append(df2)')
<string>:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. <string>:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df1
| A | B | |
|---|---|---|
| 1 | A1 | B1 |
| 2 | A2 | B2 |
df2
| A | B | |
|---|---|---|
| 3 | A3 | B3 |
| 4 | A4 | B4 |
df1.append(df2)
| A | B | |
|---|---|---|
| 1 | A1 | B1 |
| 2 | A2 | B2 |
| 3 | A3 | B3 |
| 4 | A4 | B4 |
quince = diez_datos.append(cinco_datos)
quince
/var/folders/hc/jwd08zzs6zg8d4ktgzqptl700000gp/T/ipykernel_5336/4097746820.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. quince = diez_datos.append(cinco_datos)
| seller_id | seller_zip_code_prefix | seller_city | seller_state | |
|---|---|---|---|---|
| 1327 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP |
| 1900 | 344223b2a90784f64136a8a5da012e7f | 89117 | gaspar | SC |
| 593 | 30829ded4523ab9224b93bc49a62c95f | 95360 | parai | RS |
| 2060 | a2b911e9ad40dc7d61c142a4683e2d85 | 27353 | barra mansa | RJ |
| 1574 | 2d9e8987a2ea5cf88dce330249043ff1 | 15155 | jaci | SP |
| 1130 | e3dd723429d1b23614b7fa85e2c7a853 | 9810 | sao bernardo do campo | SP |
| 2377 | eed78ac17f7f795a19a709745f00cd4e | 18608 | botucatu | SP |
| 790 | bd697db56ff8e7c9cd1abeb7f72029b2 | 3533 | sao paulo | SP |
| 1362 | 8a43128d7f9a3db592b866e6861a6cce | 8080 | sao paulo | SP |
| 2677 | 49f8d2ea7b718d0ce587d34207caf2b4 | 9830 | sao bernardo do campo | SP |
| 3090 | 98dddbc4601dd4443ca174359b237166 | 87111 | sarandi | PR |
| 3091 | f8201cab383e484733266d1906e2fdfa | 88137 | palhoca | SC |
| 3092 | 74871d19219c7d518d0090283e03c137 | 4650 | sao paulo | SP |
| 3093 | e603cf3fec55f8697c9059638d6c8eb5 | 96080 | pelotas | RS |
| 3094 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP |
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')
df1
| employee | group | |
|---|---|---|
| 0 | Bob | Accounting |
| 1 | Jake | Engineering |
| 2 | Lisa | Engineering |
| 3 | Sue | HR |
df2
| employee | hire_date | |
|---|---|---|
| 0 | Lisa | 2004 |
| 1 | Bob | 2008 |
| 2 | Jake | 2012 |
| 3 | Sue | 2014 |
Merge reconoce automaticamente que tenemos una columna de empleado y no necesita decirle con key mergear.
df3 = pd.merge(df1, df2)
df3
| employee | group | hire_date | |
|---|---|---|---|
| 0 | Bob | Accounting | 2008 |
| 1 | Jake | Engineering | 2012 |
| 2 | Lisa | Engineering | 2004 |
| 3 | Sue | HR | 2014 |
geo_df = pd.read_csv("data/olist_geolocation_dataset.csv")
geo_df
| geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | |
|---|---|---|---|---|---|
| 0 | 1037 | -23.545621 | -46.639292 | sao paulo | SP |
| 1 | 1046 | -23.546081 | -46.644820 | sao paulo | SP |
| 2 | 1046 | -23.546129 | -46.642951 | sao paulo | SP |
| 3 | 1041 | -23.544392 | -46.639499 | sao paulo | SP |
| 4 | 1035 | -23.541578 | -46.641607 | sao paulo | SP |
| ... | ... | ... | ... | ... | ... |
| 1000158 | 99950 | -28.068639 | -52.010705 | tapejara | RS |
| 1000159 | 99900 | -27.877125 | -52.224882 | getulio vargas | RS |
| 1000160 | 99950 | -28.071855 | -52.014716 | tapejara | RS |
| 1000161 | 99980 | -28.388932 | -51.846871 | david canabarro | RS |
| 1000162 | 99950 | -28.070104 | -52.018658 | tapejara | RS |
1000163 rows × 5 columns
quince
| seller_id | seller_zip_code_prefix | seller_city | seller_state | |
|---|---|---|---|---|
| 1327 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP |
| 1900 | 344223b2a90784f64136a8a5da012e7f | 89117 | gaspar | SC |
| 593 | 30829ded4523ab9224b93bc49a62c95f | 95360 | parai | RS |
| 2060 | a2b911e9ad40dc7d61c142a4683e2d85 | 27353 | barra mansa | RJ |
| 1574 | 2d9e8987a2ea5cf88dce330249043ff1 | 15155 | jaci | SP |
| 1130 | e3dd723429d1b23614b7fa85e2c7a853 | 9810 | sao bernardo do campo | SP |
| 2377 | eed78ac17f7f795a19a709745f00cd4e | 18608 | botucatu | SP |
| 790 | bd697db56ff8e7c9cd1abeb7f72029b2 | 3533 | sao paulo | SP |
| 1362 | 8a43128d7f9a3db592b866e6861a6cce | 8080 | sao paulo | SP |
| 2677 | 49f8d2ea7b718d0ce587d34207caf2b4 | 9830 | sao bernardo do campo | SP |
| 3090 | 98dddbc4601dd4443ca174359b237166 | 87111 | sarandi | PR |
| 3091 | f8201cab383e484733266d1906e2fdfa | 88137 | palhoca | SC |
| 3092 | 74871d19219c7d518d0090283e03c137 | 4650 | sao paulo | SP |
| 3093 | e603cf3fec55f8697c9059638d6c8eb5 | 96080 | pelotas | RS |
| 3094 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP |
¿Que columna elijo para mergear?
geo_df.columns
Index(['geolocation_zip_code_prefix', 'geolocation_lat', 'geolocation_lng',
'geolocation_city', 'geolocation_state'],
dtype='object')
pd.merge(quince,geo_df, how="left", left_on = "seller_zip_code_prefix", right_on="geolocation_zip_code_prefix" )
| seller_id | seller_zip_code_prefix | seller_city | seller_state | geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP | 4186 | -23.652816 | -46.599410 | sao paulo | SP |
| 1 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP | 4186 | -23.649507 | -46.597288 | sao paulo | SP |
| 2 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP | 4186 | -23.652020 | -46.598036 | são paulo | SP |
| 3 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP | 4186 | -23.645393 | -46.593474 | sao paulo | SP |
| 4 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP | 4186 | -23.647841 | -46.596295 | são paulo | SP |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1178 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP | 12051 | -23.011130 | -45.592347 | taubate | SP |
| 1179 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP | 12051 | -23.013452 | -45.584299 | taubaté | SP |
| 1180 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP | 12051 | -23.009155 | -45.592019 | taubate | SP |
| 1181 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP | 12051 | -23.009019 | -45.584925 | taubate | SP |
| 1182 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP | 12051 | -23.010990 | -45.586938 | taubaté | SP |
1183 rows × 9 columns
pd.merge(quince,geo_df, how="inner", left_on = "seller_zip_code_prefix", right_on="geolocation_zip_code_prefix" )
| seller_id | seller_zip_code_prefix | seller_city | seller_state | geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP | 4186 | -23.652816 | -46.599410 | sao paulo | SP |
| 1 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP | 4186 | -23.649507 | -46.597288 | sao paulo | SP |
| 2 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP | 4186 | -23.652020 | -46.598036 | são paulo | SP |
| 3 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP | 4186 | -23.645393 | -46.593474 | sao paulo | SP |
| 4 | b18871c324516afa228a13e8f1ef0ae9 | 4186 | sao paulo | SP | 4186 | -23.647841 | -46.596295 | são paulo | SP |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1178 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP | 12051 | -23.011130 | -45.592347 | taubate | SP |
| 1179 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP | 12051 | -23.013452 | -45.584299 | taubaté | SP |
| 1180 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP | 12051 | -23.009155 | -45.592019 | taubate | SP |
| 1181 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP | 12051 | -23.009019 | -45.584925 | taubate | SP |
| 1182 | 9e25199f6ef7e7c347120ff175652c3b | 12051 | taubate | SP | 12051 | -23.010990 | -45.586938 | taubaté | SP |
1183 rows × 9 columns
list(quince.seller_zip_code_prefix.unique())
[4186, 89117, 95360, 27353, 15155, 9810, 18608, 3533, 8080, 9830, 87111, 88137, 4650, 96080, 12051]
list(geo_df.geolocation_zip_code_prefix.unique())
[1037, 1046, 1041, 1035, 1012, 1047, 1013, 1029, 1011, 1032, 1014, 1039, 1024, 1009, 1020, 1043, 1050, 1045, 1007, 1005, 1017, 1019, 1027, 1033, 1006, 1034, 1042, 1040, 1026, 1036, 1048, 1002, 1031, 1022, 1008, 1010, 1003, 1001, 1015, 1038, 1016, 1049, 1021, 1018, 1044, 1030, 1025, 1004, 1023, 1028, 1155, 1122, 1150, 1123, 1139, 1153, 1104, 1136, 1130, 1103, 1156, 1151, 1140, 1120, 1106, 1124, 1131, 1121, 1141, 1125, 1102, 1127, 1137, 1135, 1138, 1128, 1109, 1132, 1108, 1107, 1154, 1152, 1134, 1144, 1129, 1105, 1126, 1142, 1133, 1101, 1189, 1202, 1230, 1221, 1217, 1243, 1203, 1214, 1257, 1231, 1216, 1226, 1220, 1223, 1213, 1227, 1240, 1258, 1222, 1211, 1244, 1238, 1204, 1234, 1228, 1224, 1229, 1232, 1218, 1215, 1239, 1205, 1201, 1259, 1210, 1249, 1212, 1248, 1233, 1219, 1235, 1209, 1236, 1206, 1207, 1251, 1242, 1241, 1208, 1254, 1255, 1245, 1225, 1252, 1237, 1200, 1250, 1247, 1246, 1262, 1253, 1290, 1256, 1303, 1309, 1310, 1323, 1321, 1333, 1314, 1331, 1307, 1318, 1322, 1308, 1306, 1301, 1302, 1312, 1319, 1316, 1317, 1313, 1324, 1327, 1304, 1332, 1326, 1315, 1325, 1329, 1330, 1311, 1305, 1320, 1328, 1402, 1421, 1403, 1409, 1411, 1452, 1423, 1417, 1408, 1455, 1401, 1410, 1453, 1415, 1416, 1445, 1424, 1422, 1419, 1414, 1420, 1451, 1446, 1406, 1454, 1426, 1413, 1433, 1427, 1405, 1435, 1449, 1425, 1430, 1404, 1442, 1407, 1418, 1472, 1443, 1428, 1434, 1457, 1412, 1448, 1456, 1432, 1441, 1489, 1429, 1450, 1440, 1438, 1444, 1436, 1447, 1439, 1431, 1437, 1526, 1509, 1504, 1536, 1535, 1511, 1534, 1514, 1529, 1539, 1532, 1542, 1548, 1543, 1544, 1546, 1523, 1506, 1518, 1510, 1537, 1531, 1549, 1533, 1525, 1520, 1550, 1503, 1541, 1545, 1522, 1508, 1519, 1547, 1538, 1501, 1521, 1530, 1505, 1502, 1551, 1540, 1528, 1513, 1527, 1524, 1507, 1512, 1552, 1553, 1516, 1517, 2019, 2069, 2067, 2020, 2075, 2012, 2076, 2035, 2034, 2022, 2021, 2080, 2066, 2085, 2030, 2064, 2029, 2060, 2054, 2072, 2017, 2018, 2043, 2074, 2038, 2016, 2040, 2056, 2051, 2041, 2039, 2073, 2082, 2033, 2013, 2079, 2037, 2083, 2068, 2071, 2047, 2078, 2044, 2084, 2055, 2042, 2045, 2036, 2046, 2010, 2081, 2050, 2052, 2031, 2028, 2065, 2077, 2011, 2049, 2032, 2053, 2009, 2089, 2088, 2150, 2133, 2167, 2130, 2188, 2152, 2138, 2134, 2176, 2117, 2119, 2131, 2124, 2129, 2132, 2110, 2135, 2127, 2114, 2178, 2177, 2120, 2161, 2112, 2126, 2169, 2123, 2189, 2128, 2179, 2136, 2174, 2186, 2137, 2185, 2181, 2180, 2111, 2187, 2121, 2190, 2139, 2143, 2122, 2115, 2118, 2125, 2113, 2147, 2141, 2170, 2166, 2175, 2168, 2116, 2145, 2151, 2146, 2144, 2162, 2142, 2251, 2256, 2246, 2223, 2225, 2204, 2235, 2245, 2243, 2208, 2222, 2232, 2203, 2250, 2263, 2238, 2261, 2267, 2270, 2227, 2273, 2244, 2229, 2215, 2259, 2228, 2254, 2218, 2247, 2226, 2274, 2220, 2217, 2265, 2276, 2242, 2269, 2221, 2279, 2249, 2237, 2206, 2258, 2236, 2205, 2241, 2219, 2257, 2230, 2210, 2266, 2211, 2280, 2278, 2264, 2201, 2253, 2281, 2248, 2255, 2277, 2272, 2271, 2282, 2224, 2275, 2233, 2209, 2212, 2288, 2260, 2234, 2213, 2202, 2262, 2216, 2214, 2268, 2207, 2239, 2252, 2240, 2231, 2290, 2284, 2289, 2286, 2309, 2377, 2371, 2323, 2311, 2374, 2316, 2366, 2321, 2315, 2330, 2319, 2312, 2341, 2334, 2347, 2365, 2375, 2372, 2367, 2343, 2306, 2313, 2305, 2340, 2360, 2344, 2348, 2322, 2356, 2352, 2358, 2308, 2318, 2337, 2342, 2355, 2303, 2363, 2324, 2373, 2302, 2350, 2378, 2317, 2351, 2301, 2332, 2376, 2326, 2335, 2346, 2362, 2357, 2329, 2314, 2310, 2325, 2354, 2320, 2361, 2338, 2331, 2333, 2307, 2349, 2353, 2336, 2339, 2368, 2370, 2328, 2359, 2345, 2327, 2364, 2304, 2430, 2402, 2442, 2409, 2467, 2422, 2441, 2452, 2435, 2464, 2431, 2478, 2436, 2473, 2415, 2468, 2401, 2419, 2479, 2471, 2417, 2410, 2420, 2462, 2461, 2443, 2403, 2466, 2404, 2433, 2406, 2474, 2434, 2407, 2421, 2472, 2470, 2460, 2475, 2416, 2418, 2413, 2451, 2465, 2405, 2469, 2440, 2408, 2450, 2411, 2432, 2463, 2423, 2565, 2542, 2522, 2536, 2566, 2519, 2513, 2550, 2556, 2544, 2543, 2560, 2517, 2510, 2518, 2554, 2539, 2530, 2535, 2537, 2563, 2561, 2555, 2564, 2512, 2534, 2525, 2562, 2520, 2557, 2553, 2546, 2516, 2514, 2552, 2524, 2540, 2508, 2558, 2545, 2515, 2531, 2526, 2523, 2541, 2567, 2551, 2521, 2533, 2511, 2559, 2532, 2538, 2598, 2577, 2501, 2610, 2676, 2612, 2632, 2673, 2616, 2615, 2617, 2613, 2630, 2618, 2651, 2633, 2675, 2611, 2652, 2671, 2650, 2679, 2674, 2680, 2636, 2649, 2634, 2672, 2614, 2635, 2670, 2678, 2637, 2677, 2631, 2643, 2653, 2760, 2712, 2721, 2739, 2710, 2730, 2751, 2765, 2722, 2752, 2758, 2764, 2724, 2755, 2723, 2720, 2731, 2726, 2728, 2757, 2714, 2735, 2763, 2736, 2734, 2762, 2727, 2733, 2732, 2759, 2725, 2738, 2729, 2715, 2754, 2737, 2753, 2756, 2716, 2717, 2713, 2701, 2750, 2855, 2840, 2874, 2866, 2801, 2842, 2879, 2848, 2809, 2814, 2841, 2878, 2871, 2881, 2815, 2859, 2820, 2834, 2861, 2811, 2882, 2804, 2865, 2835, 2837, 2847, 2839, 2850, 2831, 2812, 2813, 2873, 2863, 2849, 2860, 2846, 2877, 2856, 2883, 2816, 2845, 2867, 2810, 2802, 2817, 2806, 2862, 2870, 2805, 2880, 2864, 2875, 2851, 2872, 2853, 2844, 2803, 2854, 2808, 2807, 2858, 2832, 2836, 2857, 2833, 2888, 2829, 2876, 2992, 2960, 2976, 2987, 2995, 2996, 2952, 2921, 2917, 2969, 2935, 2991, 2951, 2914, 2989, 2945, 2998, 2957, 2920, 2936, 2977, 2932, 2910, 2919, 2962, 2975, 2929, 2965, 2927, 2918, 2942, 2984, 2964, 2913, 2937, 2950, 2966, 2912, 2986, 2930, 2968, 2961, 2983, 2990, 2911, 2982, 2967, 2941, 2926, 2993, 2939, 2928, 2915, 2943, 2933, 2940, 2931, 2972, 2925, 2934, 2956, 2963, 2985, 2978, 2938, 2944, 2916, 2924, 2959, 2979, 2997, 2994, 2970, 2955, 2971, 2981, 2909, 2908, 2988, 3088, 3012, 3080, 3084, 3060, 3025, 3081, 3023, 3021, 3018, 3090, 3043, 3029, 3071, 3064, 3057, 3053, 3089, 3015, 3066, 3009, 3065, 3077, 3041, 3059, 3020, 3062, 3063, 3052, 3061, 3086, 3034, 3022, 3058, 3087, 3073, 3003, 3024, 3004, 3030, 3045, 3069, 3001, 3047, 3006, 3035, 3067, 3019, 3040, 3027, 3011, 3050, 3085, 3008, 3042, 3016, 3013, 3070, 3002, 3031, 3051, 3068, 3072, 3010, 3014, 3028, 3054, 3033, 3055, 3005, 3017, 3032, 3026, 3007, 3076, 3044, 3036, 3046, 3074, 3165, 3166, ...]
geo_df.geolocation_zip_code_prefix.nunique()
19015
lista_unicos_geo = list(geo_df.geolocation_zip_code_prefix.unique())
quince.seller_zip_code_prefix.isin(lista_unicos_geo).sum()
15
quince.seller_zip_code_prefix.isin(lista_unicos_geo)
1327 True 1900 True 593 True 2060 True 1574 True 1130 True 2377 True 790 True 1362 True 2677 True 3090 True 3091 True 3092 True 3093 True 3094 True Name: seller_zip_code_prefix, dtype: bool
unicos_sellers = list(quince.seller_zip_code_prefix.unique())
geo_df.geolocation_zip_code_prefix.isin(unicos_sellers).sum()
1183
df_filtrado = geo_df[geo_df.geolocation_zip_code_prefix.isin(unicos_sellers)]
df_filtrado
| geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | |
|---|---|---|---|---|---|
| 58652 | 3533 | -23.543315 | -46.521247 | sao paulo | SP |
| 58659 | 3533 | -23.546867 | -46.511526 | sao paulo | SP |
| 59031 | 3533 | -23.542838 | -46.522099 | sao paulo | SP |
| 59072 | 3533 | -23.545301 | -46.514192 | são paulo | SP |
| 59137 | 3533 | -23.542007 | -46.523187 | sao paulo | SP |
| ... | ... | ... | ... | ... | ... |
| 977477 | 96080 | -31.753613 | -52.315717 | pelotas | RS |
| 977494 | 96080 | -31.747875 | -52.327321 | pelotas | RS |
| 977528 | 96080 | -31.753442 | -52.315974 | pelotas | RS |
| 977570 | 96080 | -31.749108 | -52.321976 | pelotas | RS |
| 977579 | 96080 | -31.748636 | -52.310751 | pelotas | RS |
1183 rows × 5 columns
df_filtrado.groupby("geolocation_zip_code_prefix").agg('nunique')
| geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | |
|---|---|---|---|---|
| geolocation_zip_code_prefix | ||||
| 3533 | 34 | 34 | 2 | 1 |
| 4186 | 34 | 34 | 2 | 1 |
| 4650 | 42 | 42 | 2 | 1 |
| 8080 | 141 | 140 | 2 | 1 |
| 9810 | 127 | 128 | 2 | 1 |
| 9830 | 37 | 37 | 2 | 1 |
| 12051 | 64 | 64 | 2 | 1 |
| 15155 | 27 | 27 | 1 | 1 |
| 18608 | 162 | 163 | 1 | 1 |
| 27353 | 21 | 21 | 1 | 1 |
| 87111 | 62 | 62 | 1 | 1 |
| 88137 | 27 | 27 | 2 | 1 |
| 89117 | 16 | 16 | 1 | 1 |
| 95360 | 28 | 28 | 2 | 1 |
| 96080 | 84 | 84 | 1 | 1 |
No solo se pueden repetir datos tmb se pueden borrar en los joins
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
'drink': ['wine', 'beer']},
columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')
df6
| name | food | |
|---|---|---|
| 0 | Peter | fish |
| 1 | Paul | beans |
| 2 | Mary | bread |
df7
| name | drink | |
|---|---|---|
| 0 | Mary | wine |
| 1 | Joseph | beer |
pd.merge(df6, df7)
| name | food | drink | |
|---|---|---|---|
| 0 | Mary | bread | wine |
pd.merge(df6, df7, how='inner')
| name | food | drink | |
|---|---|---|---|
| 0 | Mary | bread | wine |
display('df6', 'df7', "pd.merge(df6, df7, how='left')")
df6
| name | food | |
|---|---|---|
| 0 | Peter | fish |
| 1 | Paul | beans |
| 2 | Mary | bread |
df7
| name | drink | |
|---|---|---|
| 0 | Mary | wine |
| 1 | Joseph | beer |
pd.merge(df6, df7, how='left')
| name | food | drink | |
|---|---|---|---|
| 0 | Peter | fish | NaN |
| 1 | Paul | beans | NaN |
| 2 | Mary | bread | wine |
display('df7', 'df6', "pd.merge(df7, df6, how='left')")
df7
| name | drink | |
|---|---|---|
| 0 | Mary | wine |
| 1 | Joseph | beer |
df6
| name | food | |
|---|---|---|
| 0 | Peter | fish |
| 1 | Paul | beans |
| 2 | Mary | bread |
pd.merge(df7, df6, how='left')
| name | drink | food | |
|---|---|---|---|
| 0 | Mary | wine | bread |
| 1 | Joseph | beer | NaN |
args and kwargs
def funcion_generica(parametro_obligatorio, *args, **kwargs):
"""This is the function docstring"""
print(parametro_obligatorio)
if args:
print(args)
if kwargs:
print(kwargs)
funcion_generica(1, 2, 3, variable_no_obligatoria = 5)
1
(2, 3)
{'variable_no_obligatoria': 5}
a_list = [1,2,3]
a_dict = {'bar': 5}
funcion_generica(*a_list, **a_dict)
1
(2, 3)
{'bar': 5}
class Student:
school = 'UCEMA'
def __init__(self, name, age):
self.name = name
self.age = age
alice = Student('Alicia', 20)
alice.__dict__
{'name': 'Alicia', 'age': 20}
d = {'a': 1, 'b': 2}
d.update(c=3)
d
{'a': 1, 'b': 2, 'c': 3}
class Student:
school = 'UCEMA'
def __init__(self, name, age, **kwargs):
self.name = name
self.age = age
self.__dict__.update(**kwargs)
alice = Student('Alicia', 20, nationality='francesa')
alice.nationality
'francesa'
alice.__dict__
{'name': 'Alicia', 'age': 20, 'nationality': 'francesa'}
Herencia!
class DataStudent(Student):
course = 'data'
#se agrega un atributo en la clase hija
def __init__(self, name, age, año, **kwargs):
super().__init__(name, age, **kwargs)
self.año = año
DataStudent('Alicia', 20, nationality='francesa').__dict__
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [66], in <cell line: 1>() ----> 1 DataStudent('Alicia', 20, nationality='francesa').__dict__ TypeError: __init__() missing 1 required positional argument: 'año'
DataStudent('Alicia', 20, nationality='francesa', año="2022").__dict__
{'name': 'Alicia', 'age': 20, 'nationality': 'francesa', 'año': '2022'}
!pip install plotly
Collecting plotly
Downloading plotly-5.10.0-py2.py3-none-any.whl (15.2 MB)
|████████████████████████████████| 15.2 MB 11.5 MB/s
Collecting tenacity>=6.2.0
Downloading tenacity-8.1.0-py3-none-any.whl (23 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.10.0 tenacity-8.1.0
WARNING: You are using pip version 21.3.1; however, version 22.3 is available.
You should consider upgrading via the '/Users/federicomoreno/PycharmProjects/UCEMA/venv/bin/python -m pip install --upgrade pip' command.
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species")
fig.show();
Scikit-learn (Sklearn) es una biblioteca de aprendizaje automático que proporciona herramientas de preprocesamiento de datos, modelado y selección de modelos.
👉 https://scikit-learn.org
Varias opciones
!pip install -U scikit-learn
!conda install scikit-learn
Ir al interpreter de pycharm
👉 Documentacion de instalación https://scikit-learn.org/stable/install.html
module¶linear_model es un moduloLinearRegression es una clase👉 https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model
Hay muchas formas de importar módulos y clases en cuadernos, pero hay una mejor práctica.
🚫 import sklearn # importación de toda la biblioteca
model = sklearn.linear_model.LnearRegression() # debe escribir la biblioteca y el prefijo del módulo cada vez
🚫 import sklearn.linear_model # importación de todo el módulo
model = linear_model.LinearRegression() # debe escribir el prefijo del módulo cada vez
🚫 from sklearn import linear_model # importación de todo el módulo
model = linear_model.LinearRegression() # debe escribir el prefijo del módulo cada vez
🚫 desde sklearn.linear_model import * # importación de todo el módulo
modelo = Regresión Lineal()
import this
The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
"Explicit is better than implicit"
✅ from sklearn.linear_model import LinearRegression # explicit class import from module
model = LinearRegression() #=> we know where this object comes from

Una regresión lineal (OLS) mapea una relación lineal entre la entrada X y la salida y. Optimiza la pendiente a y la intersección b al reducir los residuos entre la y real y la y pronosticada.

import pandas as pd
data = pd.read_csv("house.csv")
data.head()
| Unnamed: 0 | Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 86 columns
Comencemos de manera simple modelando SalePrice (y) de acuerdo con GrLivArea (X).
livecode_data = data[['GrLivArea','SalePrice']]
livecode_data.head()
| GrLivArea | SalePrice | |
|---|---|---|
| 0 | 1710 | 208500 |
| 1 | 1262 | 181500 |
| 2 | 1786 | 223500 |
| 3 | 1717 | 140000 |
| 4 | 2198 | 250000 |
import matplotlib.pyplot as plt
# Plot Living area vs Sale price
plt.scatter(data['GrLivArea'], data['SalePrice'])
# Labels
plt.xlabel("Living area")
plt.ylabel("Sale price")
plt.show()
Entrenando una regresion linal con LinearRegression
# Import the model
from sklearn.linear_model import LinearRegression
# Instanciate the model (💡 in Sklearn often called "estimator")
model = LinearRegression()
# Define X and y
X = data[['GrLivArea']]
y = data['SalePrice']
# Train the model on the data
model.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
En esta etapa, el modelo ha aprendido los parámetros óptimos (pendiente a e intersección b) necesarios para mapear la relación entre X e y.
# View the model's slope (a)
model.coef_
array([105.00927564])
# View the model's intercept (b)
model.intercept_
22104.12101002081
Cada algoritmo de Scikit-learn tiene una métrica de puntuación predeterminada.
LinearRegression utiliza el Coeficiente de determinación (R2) por defecto.
R2 representa la proporción de la varianza del objetivo explicada por las características. La puntuación suele variar entre 0 y 1 Cuanto mayor sea la puntuación, mejor será el modelo.
# Evaluate the model's performance
model.score(X,y)
0.48960426399689116
# Predict on new data
model.predict([[1000]])
/Users/federicomoreno/PycharmProjects/UCEMA/venv/lib/python3.8/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
array([127113.39664561])
Predice un departamento de 1000 pies cuadrados y lo estima en casi 125mil dolares

from sklearn.model_selection import train_test_split
# split the data into train and test
train_data, test_data = train_test_split(livecode_data, test_size=0.3)
# Ready X's and y's
X_train = train_data[['GrLivArea']]
y_train = train_data['SalePrice']
X_test = test_data[['GrLivArea']]
y_test = test_data['SalePrice']
# Ready X and y
X = livecode_data[['GrLivArea']]
y = livecode_data['SalePrice']
# Split into Train/Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Instantiate the model
model = LinearRegression()
# Train the model on the Training data
model.fit(X_train, y_train)
# Score the model on the Test data
model.score(X_test,y_test)
0.5384589652653409
